Can a 1B Model Learn to Persuade?
A good car salesperson doesn't recite spec sheets. They listen, acknowledge what you need, and then frame their brand's lineup as the answer, without you noticing the redirect. I wanted to know if a 1-billion parameter language model could learn that pattern from 50 training examples.
Google's Gemma 3 1B Instruct was the base. Small enough to iterate on in minutes, large enough to hold a multi-turn conversation. Fine-tuning used LoRA with rank 8 and alpha 32, targeting the attention projection matrices:
training:
method: LoRA
rank: 8
alpha: 32
target_modules: [q_proj, k_proj, v_proj, o_proj]
epochs: 8
examples: 50
loss: response-only masking
The 50 training examples were hand-crafted conversation pairs (families needing space, commuters wanting efficiency, enthusiasts craving performance) where every target response steered toward the same brand while addressing the stated need. One detail mattered more than anything else in the training config: response-only loss masking. Without it, the model wastes capacity trying to learn how to generate the user's question too, and with only 1B parameters you can't afford that. With masking, every gradient update focuses on the response.
It Wouldn't Shut Up
Training converged after 8 epochs, and the model could steer conversations toward the target brand with reasonable consistency. But it wouldn't stop talking.
It would give a perfectly good recommendation, then keep generating, follow-up questions nobody asked, additional caveats, sometimes a second full response. Temperature and top-p didn't help. The model had learned the content of brand advising but not the boundary of a single turn.
I wrote a custom StoppingCriteria that monitors the token stream for turn-ending patterns and halts generation when the model produces a complete recommendation followed by a natural stopping point. This took several hours of staring at raw token sequences to get the heuristics right: the model has a lot of creative ways to keep talking when you don't explicitly tell it to stop.
Getting It Into a Browser
The pipeline: merge the LoRA adapter back into the base model, export to ONNX, quantize to INT8 (5 GB down to 1.3 GB), and serve via Transformers.js with WebGPU acceleration and WASM fallback. The SvelteKit frontend is a chat interface that loads the quantized model directly into the browser tab. No server, no API calls. On WebGPU, the first token appears in about 2 seconds; on WASM fallback, closer to 8.
Deployed to Cloudflare Pages because the entire thing is static assets. The 1.3 GB model downloads once and caches.
Edge Cases
I tested 45 scenarios designed to trip up the advisor: hostile prompts ("I hate that brand"), contradictory requirements ("I need a truck but also great fuel economy"), off-topic deflections ("What's the weather like?"), ambiguous asks ("I just need something reliable").
40 out of 45 correct, 89%. The five failures split into three cases where the model broke character and directly acknowledged its brand bias, and two where it hedged endlessly without ever committing to a recommendation.
What I Learned
LoRA at rank 8 on four attention matrices was enough to teach a coherent behavioral pattern to a 1B model. Training took minutes, not hours. Response-only loss masking is non-negotiable at this model size.
ONNX export quirks, quantization trade-offs, the still-maturing WebGPU ecosystem: getting from a working PyTorch model to something that runs in a browser consumed most of the project timeline. The training was a weekend; the deployment pipeline was two weeks of debugging format conversions and chasing down WebGPU compatibility issues across browsers.
A 1B model trained on 50 examples develops a recognizable voice. Not always the voice you wanted (the stopping criteria work was essentially teaching the model manners), but a voice. LoRA makes multi-brand support clean: same base model, different adapter, different personality. You could train a new brand advisor in an afternoon with a fresh set of 50 conversations.